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To classify Naive Bayes classification (NBC), however, it is necessary to have 
a previous pre-processing and feature extraction. Generally, pre-processing 
eliminates unnecessary words while feature extraction processes these words. 
This paper focuses on feature extraction in which calculations and searches are 
used by applying word2vec while in frequency using term frequency-Inverse 
document frequency (TF-IDF). The process of classifying words on Twitter 
with 1734 tweets which are defined as a document to weight the calculation of 
frequency with TF-IDF with words that often come out in tweet, the value of 
TF-IDF decreases and vice versa. Following the achievement of the weight 
value of the word in the tweet, the classification is carried out using Naive 
Bayes with 1734 test data, yielding an accuracy of 88.8% in the Slack word 
category tweet and while in the tweet category of verb 78.79%. It can be 
concluded that the data in the form of words available on twitter can be 
classified and those that refer to slack words and verbs with a fairly good level 


of accuracy. so that it manifests from the habit of twitter social media user. 
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1. INTRODUCTION 

Based on past experience to recognize opportunities and predict the future by applying probability 
and statistics is the theorem of the Naive Bayes method [1]. Strong or naive and assumptions, regardless of the 
condition or event of each is a character of Naive Bayes [2]. Several data mining operations by applying Naive 
Bayes with image data and numerical data from several diseases to obtain classification results [3]. In addition, 
classiying the behavior of web users applies naive Bayes in the hope of obtaining optimal word segmentation 
results [4]. Many naive Bayes applications classify both numerical data, images and web data with other things 
that are done by data crawling [5]. 

It is because classification is a method of using data to develop a new computational model in a certain 
area [6],[7]. The classification procedure employs a precise technique that differs from model to model, and a 
high level of accuracy is achieved when the accuracy reaches 100% [8]. It signifies that the final model 
produced good outcomes in terms of model creation using training and testing data. While classification by 
applying naive Bayes is to detect hate speech on Twitter social media with the hope that the naive Bayes 
method is able to study the previous data in Twitter and get accuracy in the carried-out test [9], by a system 
applying the naive Bayes classifier by 93%. Meanwhile AlSalman [10] also conducted research on the 
application of naive Bayes with the topic of sentiment analysis on social media content to get opinions from 
several different applications and fields such as hobbies, activities and work carried out in Twitter which uses 
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Arabic, the results of the experiment get useful from the proposed approach proposed to be continued. From 
these results were obtained comparisons showing this approach outperforms the field of work and it can 
increase the accuracy of 0.3%. Of various studies, naive Bayes is often used in classification on social media 
to get sentiment analysis [11]. At this time, social media is one aspect that is very close to users, social media 
users are used to creating and sharing content any between users [12]-[15]. Runining social media about 142 
minutes a day [16], the initial low increases 100 minutes to get 142 minutes per day of use [17]. It is difficult 
to identify whether such platforms are profitable or detrimental to social media users, even though people 
around the world spend a large part of their days on social media platforms. 

This is related to the research conducted by Lubis ef al. [18] finding a framework for social media 
users, both the disclosure of words on social media is the keyword as the habit of social media users with 
initialized steps by reviewing current postings in order to have good data that is more particular and precise 
than those obtained from netizens [19]. These exact keywords can then be used by the social media system to 
detect the profile of a certain user in the online domain in which a search engine can access. This certainly 
opens up insights that the behavior of social media users could be classified by applying the naive Bayes 
method with training data in the form of words and keywords to classify words on social media [20]. In 
matching and obtaining the frequency of the word data, nevertheless, a feature extraction stage is needed. So 
that in this study a comparison of several feature extraction techniques was carried out to obtain the optimal 
classification process. 

So many feature extractions are available that the research focuses on feature extraction on word 
classification on social media using naive Bayes. In line with the development of data science applied in this 
paper, however, feature extraction uses term frequency-Inverse document frequency (TF-IDF) and Word2Vec. 
Where Word2Vec predicts the word given the surrounding context and after the occurrence of the model is 
created, what context vector operation is appropriate to perform the task for classification on the word in the 
new tweet [21]. 


2. MATERIAL AND METHOD 
2.1. Data Mining in Social Network 
Data mining is a term that usually refers to knowledge findings in databases. It is a process that 
practices mathematical, statistical, artificial intelligence, and machine education methods that extract and 
recognize useful data and knowledge gathered from large databases [22]. Data mining, furthermore, is also 
referred to as the process of finding patterns, trends, and meaningful relationships [23]. Before carrying out the 
data mining process, it is better to know in advance what data mining can do, so that what is done later is 
suitable with what is needed and produces something that was previously unknown and is new and useful for 
its own users [24]. In principle, data mining has several tasks and must ensure that the pattern runs correctly in 
the process. There are 2 types of information mining tasks, namely [25]: 
a. Predictive 
Estimating a certain attribute's value based on the values of other attributes. The dependent and target 
variables in such a case are called attributes, while the independent variable attribute is used to predict 
b. Descriptive 
Obtaining patterns such as groups, trajectories, correlations, anomalies, and trends, which summarize the 
underlying relationships in the data is the task of descriptive. Descriptive data mining tasks are also known 
as investigations and often require post-processing techniques for explanation and validation of the 
results. 


2.2. Naive Bayes Algorithm 

One of the methods of classification is Naive Bayes, this algorithm was invented by Thomas Bayes 
who is a scientist from England. Future opportunities can be predicted based on previous experience is the goal 
of Naive Bayes [26]. This Naive Bayes Classifier has the main characteristic of being very strong (naive) 
assumptions about each condition's self-sufficiency. In compared to other classifier models, the Naive Bayes 
Classifier performs quite well. One of the benefits of this method is that it just takes a little amount of training 
data to calculate the parameter estimation used in classification. The independent variable is the variation of a 
variable in a class that is designed to decide classification, not the whole covariance matrix [27]. 

The training stage and the classification stage are the stages of Naive Bayes. The process of analyzing 
the document is carried out at the training stage where the vocabulary selection of the sample document is the 
word that appears in the sample document where the word is a representation of the document. The next step 
is to determine the probability for each category based on a sample document. Naive Bayes built a probabilistic 
model from the term documents matrix data labeled. Document classification is done by first determining the 
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category c words in the document. The process of determining the categories of a document is done by 
calculating using equation (1) [28]: 


c= argmaxciecP(c;|d;) = argmax,iec []p (wij |c:)xp (ci) (1) 


where: 

Wij is a feature or word of the document / tweet 

dj category to find out 

The value of p (wkj | ci) is known from the available training data. 


3. GENERAL ARCHITECTURE 

Good research requires a research flow. The purpose of the research flow is to describe the stages that 
are carried out, where these stages are well explained. The research flow itself is used to ensure the research 
runs as expected. The flow of the research is drawn in a general structure as depicted in Figure 1: 


Step |. Crawling 


Twitter API 


Step II. Text pr@-processing Crawling 


Word2Vec 


Step Ill. Process|and classification 


Word2Vec 
Naive Bayes 
Classification 


Figure 1. General Architecture 


The explanation of the general architecture in Figure 1 is: 

a) Step 1. The crawling process using the API on Twitter makes it easy to get tweets that will be classified. 

b) Step 2. The text preprocessing process is then continued with the feature extraction process where the 
feature extraction process is optimized by counting words assisted by Word2Vec then calculating the 
frequency with the IDF TF as well as contributing to this paper. 

c) Step 3. The classification process applies NBC with the results, word classification and accuracy. The 
NBC procedure consists of the following steps: 

i) Making decomposition data 

ii) Reading the training data 

iii) For numeric data, how to calculate the number and probability is 

— Each parameter is numeric, then the mean and standard deviation are calculated. The (2) to find the 
calculated average (mean) is as in (2): 
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where 

i: mean 

x; : the value of x toi 

n: Total samples 

While the (3) is to find the score/ value, Deviation Standard can be: 


n any) 
oe pee) (3) 


n-1 


where 

o : deviation standard 
xX; : the value of x toi 
li: mean 

n: Total samples 


— To get a probabilistic value, divide the amount of acceptable data from the same category by the amount 
of data in that category that is included. 

iv) Get value in word classification 

v) Produce accuracy with (4) 


Number of correct classifications 


Accuracy = X100% (4) 


Amount of test data 


4. RESULT AND DISCUSSION 

The data mining process in the form of classification techniques can be done using the Naive Bayes 
Algorithm. Naive Bayes generally are also often used in research that is sentiment analysis to get accuracy, 
patterns, human behavior and others available in cloud networks. The classification process with naive Bayes 
cannot be separated from the process of training and testing data so that the correct data in this study use and 
collect data taken from social media. The data crawling is then performed feature extraction to facilitate the 
classification of words on social media. Several feature extractions will be tested to optimize the word 
classification using naive Bayes on social media. 

Terms, which can be a sentence, word, or other indexing unit in a tweet that serves to establish the 
context, are things to consider while looking for information from a collection of documents or tweets. Because 
each word has a distinct amount of relevance in the tweets on the tab, an indicator, specifically the term weight, 
is supplied for each word. When using word2vec to count and search for words, there are a few things to keep 
in mind. The results are shown in Figure 2: 


Result of word count with 
Word2Vec 


= wkwkw 

@ happy 

& Selamat pagi 
= makasih 

Il Love 


@ Anniversary 


Figure 2. Results of Processing with Word2Vec 


Int J Artif Intell, Vol. 11, No. 3, September 2022: 1041-1048 


Int J Artif Intell ISSN: 2252-8938 O 1045 


Preprocessing and feature extraction are the next steps in the classification process, which will be used 
to find meaning in tweets that will be trained or tested. This procedure must be followed since the document 
test data is in the form of paragraphs containing labels obscuring its content. Before the preparation process, it 
was difficult to understand the contents of the test text. Features that can potentially be affected by 
preprocessing so it is necessary to identify the text. 

Tokenization is first performed which aims to separate characters into tokens or words. As certain 
characters could be used to separate tokens, tokenization is difficult for computer programs. To detect the 
pattern of the text which is going to be used for the categories that will be used as training data, so text 
identification is carried out. 

Next perform frequency calculations with TF-IDF. Where, the term weighting method which is 
commonly used as a comparison method with the new weighting method is TF-IDF. T term weight calculation 
of a document is done by multiplying the value of the Term Frequency Inverse Document Frequency. Some of 
the processes taken to compute the weight value using TF-IDF. 

Table | reveals that TF-IDF calculations were performed using the frequency of tweets that frequently 
appear, namely the term "wkwk," which is slang for anyone who laughs or receives amusing things, containing 
the word happy, which is a verb in the form of joy. Where, the TF-IDF calculation requires a frequency- 
weighted value that has a function to calculate the best value, this is because the higher the number of words 
when calculating the TF-IDF, the smaller the frequency. To find terms and assess the performance of 
documents which are based on tweets that appear, NBC is used. calculating the frequency of occurrence of 
words in the document is the first step taken. where the high frequency of repetition causes the greater the value 
of the word. 


Table 1. Terms of Optimization Word2Vec and TF IDF 


Word (t) TF IDF 

happy 39 =log(112/39)= 0.45788 18967 
makasih 17 =log(112/17)= 0.8188854146 
Anniversary 4 =log(112/4)= 1.447158031 
wkwkwk 112 ==log(112/112)=0 

Love 9 =log(112/9)= 1.09482038 


Selamat pagi 22 =log(112/22)= 0.7067177823 


The NBC method requires two stages in the word classification process, namely the first stage is 
training where at this stage analysis are conducted on the documents of sample. They are in the format of social 
media data, namely tweets, words which might be shown up within a collection of documents of sample as 
well as determined from people's habits on social media reflect as many documents as possible, the documents 
used for training will be a reference in the testing process, as shown in Table 2. In the second stage is testing 
there is a training document that will be used as a reference for the testing process. 


Table 2. Data Decomposition 


Word (t) Training Testing 
happy 75% 25% 
makasih 75% 25% 
Anniversary 75% 25% 
wkwkwk 75% 25% 
Love 75% 25% 
Selamat pagi 75% 25% 


In this study, data sources were employed from Twitter classified into documents as a reference to 
how papers would be classed. The targeted reference is document labeling based on expert domains. Twitter 
social media tweets are the type of document used. Twitter itself is unstructured content because there are 
things like mentions and HTML tags that cause the document to be meaningless. For classification accuracy, 
a structured document is needed so that it is easy to understand. The experimental document consists of 1734 
characters from the @arfridho account. 

In the previous stage, Despite the fact that the generated text pattern was analysed by applying 
stopwords, the irregularity of the produced text pattern presents a challenge during identification. It appears 
that identifying the text was difficult and that careful examination was necessary. Because the patterns are 
irregular in their content arrangement, it is necessary to read the documents one at a time throughout the 
identification process to grasp the existing patterns in the text. The procedure to identify the tag on the training 
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document is manually conducted. The data will be divided into two groups: slack words and verbs. Term 
frequency can be used to determine characteristics; however, experimental data indicated that only 25% of the 
chosen terms occurred frequently, which has little bearing on the categorization process. 

The data in Figure 3 are those having been achieved using the IDF TF. Then the document 
classification process requires a calculation involving the number of label documents n, m is the number of 
label documents, and the total number of training documents, which is called p (ci), namely the category x 
dividing the total documents in the category x the number of training data, similar to the category y is the 
division of the number of categorized documents y with the total number of training data, as shown in 
Table 3. In research with data with words on Twitter that were tested as many as 531 twitting of verb categories 
and 1734 tweets referring to 1203 categories of Slack words as training data resulted in accuracy calculated 
based on (4) well in the tweet category of verbs 78.79% whereas on the Slack word category tweet of 88.8%. 


120 
-®- TF a 
100 ‘ 
Ww 4 Q 
a 80 eee@ee (DF é Y 
uw 4 Q 
S 4 \ 
sz 40 @ - \ 
$ me 7 ‘ ; 
= 
20 eee r) \--r" 
) Ceereeree, SOertrers: Ctr ieree Serre rrrr Lireierry | 
happy makasih Anniversary wkwkw Love Selamat 


pagi 


Word 


Figure 3. Classification features 


Table 3. Social Media Word Classifications by NBC 
Category Word Classification Verb Slack words 


P(ci) 0.50 0.50 

P(wiilci) happy 15 24 
makasih 17 0 
Anniversary 2 2 
wkwkw 0 112 
love 9 0 
Selamat pagi 20 2 


5. CONCLUSION 

This paper draws the conclusion that data in the form of words available on twitter can be classified 
and those refering to the word slack and verb so as to manifest from the habit of the social media twitter users. 
In the process, word classifications in social media are conducted, beginning with data crawling on the Twitter 
API then carrying out preprocessing and feature extraction. Where there is interest in the feature extraction 
process with a combination of word2vec with TF-IDF where the results of the TF-IDF calculation result is 
possible to deduce that the TF-IDF value with frequencies which frequently appear gets smaller frequency 
values and conversely with less frequency than the value of TF -IDF is even bigger. Following the TF-IDF 
calculation, the classification is done using the Naive Bayes approach, which divides the word classification 
into two categories, namely the Slack word category and the verb category. The test data consisted of 1734 
twittes, with the results referring to 1203 Slack word categories and 531 twittens of verb categories as training 
data, resulting in good accuracy in the Slack word category twitt of 88.8% and 78.79% in the twitt verb 
categories. Where the results obtained from the test, obtained a fairly good accuracy in categorizing slack words 
and verbs. 
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